Goto

Collaborating Authors

 ct 2



Noise-Augmented $\ell_0$ Regularization of Tensor Regression with Tucker Decomposition

Yan, Tian, Li, Yinan, Liu, Fang

arXiv.org Machine Learning

Tensor data are multi-dimension arrays. Low-rank decomposition-based regression methods with tensor predictors exploit the structural information in tensor predictors while significantly reducing the number of parameters in tensor regression. We propose a method named NA$_0$CT$^2$ (Noise Augmentation for $\ell_0$ regularization on Core Tensor in Tucker decomposition) to regularize the parameters in tensor regression (TR), coupled with Tucker decomposition. We establish theoretically that NA$_0$CT$^2$ achieves exact $\ell_0$ regularization in linear TR and generalized linear TR on the core tensor from the Tucker decomposition. To our knowledge, NA$_0$CT$^2$ is the first Tucker decomposition-based regularization method in TR to achieve $\ell_0$ in core tensor. NA$_0$CT$^2$ is implemented through an iterative procedure and involves two simple steps in each iteration -- generating noisy data based on the core tensor from the Tucker decomposition of the updated parameter estimate and running a regular GLM on noise-augmented data on vectorized predictors. We demonstrate the implementation of NA$_0$CT$^2$ and its $\ell_0$ regularization effect in both simulation studies and real data applications. The results suggest that NA$_0$CT$^2$ improves predictions compared to other decomposition-based TR approaches, with or without regularization and it also helps to identify important predictors though not designed for that purpose.


Adaptive Gradient Methods with Dynamic Bound of Learning Rate

Luo, Liangchen, Xiong, Yuanhao, Liu, Yan, Sun, Xu

arXiv.org Machine Learning

Adaptive optimization methods such as AdaGrad, RMSprop and Adam have been proposed to achieve a rapid training process with an element-wise scaling term on learning rates. Though prevailing, they are observed to generalize poorly compared with SGD or even fail to converge due to unstable and extreme learning rates. Recent work has put forward some algorithms such as AMSGrad to tackle this issue but they failed to achieve considerable improvement over existing methods. In our paper, we demonstrate that extreme learning rates can lead to poor performance. We provide new variants of Adam and AMSGrad, called AdaBound and AMSBound respectively, which employ dynamic bounds on learning rates to achieve a gradual and smooth transition from adaptive methods to SGD and give a theoretical proof of convergence. We further conduct experiments on various popular tasks and models, which is often insufficient in previous work. Experimental results show that new variants can eliminate the generalization gap between adaptive methods and SGD and maintain higher learning speed early in training at the same time. Moreover, they can bring significant improvement over their prototypes, especially on complex deep networks. The implementation of the algorithm can be found at https://github.com/Luolc/AdaBound .


Scalable Distributed Monte-Carlo Tree Search

Yoshizoe, Kazuki (The University of Tokyo) | Kishimoto, Akihiro (Tokyo Institute of Technology and Japan Science and Technology Agency) | Kaneko, Tomoyuki (The University of Tokyo) | Yoshimoto, Haruhiro (The University of Tokyo) | Ishikawa, Yutaka (The University of Tokyo)

AAAI Conferences

Monte-Carlo Tree Search (MCTS) is remarkably successful in two-player games, but parallelizing MCTS has been notoriously difficult to scale well, especially in distributed environments. For a distributed parallel search, transposition-table driven scheduling (TDS) is known to be efficient in several domains. We present a massively parallel MCTS algorithm, that applies the TDS parallelism to the Upper Confidence bound Applied to Trees (UCT) algorithm, which is the most representative MCTS algorithm. To drastically decrease communication overhead, we introduce a reformulation of UCT called Depth-First UCT. The parallel performance of the algorithm is evaluated on clusters using up to 1,200 cores in artificial game-trees. We show that this approach scales well, achieving 740-fold speedups in the best case.